Assignment Overview

This is the markdownfile for my assignment 8 on cluster analysis. For this assignment I expanded on the code I wrote for the cluster analysis class project adding more methods for distance calculation and visualization. Once again, I used cancer gene expression and survival data to show clustering applications to analyze genetic data on two types brain cancers: Low Grade Glioma (Glioma) and Glioblastoma. Only genes that had a p-value < 0.05 relating gene expression to survival were included in the analysis. Shared genes significant to survival in both brain cancers was subsetted and then random samples of 50 and 100 genes were chosen to create the corresponding glioma and glioblastoma dataframes for clustering analysis. Methods used include hierarchiacal clustering using Average Linkage and Ward(ANOVA) and partition clustering using K-means and PAM. I also explored visualization using heatmaps with correlation-based distance measures including “pearson”, “kendall”, and “spearman” methods.

Datasets Used

Gene expression data is based on The Cancer Genome Atlast (TCGA) found on the National Cancer Institute’s Genomic Data Portal (https://gdc-portal.nci.nih.gov/). Survival data was derived from Stanford’s Precog database (https://precog.stanford.edu/), and Oncolnc (http://www.oncolnc.org/). The data was reformatted to suit clusting analysis.

precogdata <- read_excel("/Users/louisecabansay/Downloads/Precog_MetaZ.xlsx")
gbm.oncolnc <- read_excel("/Users/louisecabansay/Documents/School Work/UCBX R Class/GBM_mrna.xlsx")
gli.oncolnc <- read_excel("/Users/louisecabansay/Documents/School Work/UCBX R Class/LowerGradeGlioma.xlsx")
gbm.precog <-  data.frame(precogdata$Gene, precogdata$Name, precogdata$`Unweighted_meta-Z_of_all_cancers`, precogdata$Brain_cancer_Glioblastoma)
gli.precog <-  data.frame(precogdata$Gene, precogdata$Name, precogdata$`Unweighted_meta-Z_of_all_cancers`, precogdata$Brain_cancer_Glioma)
#examine data
names(gbm.oncolnc)[1] <- "GeneName"
names(gli.oncolnc)[1] <- "GeneName"
names(gbm.precog)[1] <-"GeneName"
names(gli.precog)[1] <-"GeneName"
str(gbm.precog)
## 'data.frame':    23287 obs. of  4 variables:
##  $ GeneName                                     : Factor w/ 23287 levels "42795","42796",..: 1570 5930 21090 21165 2212 14283 2585 2559 2481 21681 ...
##  $ precogdata.Name                              : Factor w/ 19740 levels "1-acylglycerol-3-phosphate O-acyltransferase 1 (lysophosphatidic acid acyltransferase, alpha)",..: 1343 5333 17355 17380 3178 11431 2319 2295 3244 17255 ...
##  $ precogdata..Unweighted_meta.Z_of_all_cancers.: num  12.2 12.1 11.9 11.7 11.4 ...
##  $ precogdata.Brain_cancer_Glioblastoma         : num  0.296 -0.359 -0.124 -0.742 -0.427 ...
str(gbm.oncolnc)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16840 obs. of  6 variables:
##  $ GeneName           : chr  "PTPRN" "ZIC3" "PTPRN2" "OS9" ...
##  $ Cox coefficient    : num  0.523 -0.382 0.377 0.394 0.38 ...
##  $ Raw p-value        : num  2.6e-06 4.2e-05 9.2e-05 9.5e-05 1.0e-04 2.6e-04 3.4e-04 3.6e-04 3.7e-04 3.9e-04 ...
##  $ BH-adjusted p-value: num  0.0438 0.3368 0.3368 0.3368 0.3368 ...
##  $ Median Expression  : num  485.8 20.3 1351.7 6624.2 101.9 ...
##  $ Mean Expression    : num  807 29.2 1547.9 14460.4 168 ...
str(gli.oncolnc)
## Classes 'tbl_df', 'tbl' and 'data.frame':    16822 obs. of  6 variables:
##  $ GeneName           : chr  "TGIF1" "CRTAC1" "CEP112" "ZDHHC22" ...
##  $ Cox_coefficient    : num  0.876 -0.842 0.802 -0.802 -0.77 ...
##  $ Raw_p-value        : num  2.2e-16 7.8e-16 3.9e-15 1.9e-14 2.6e-14 ...
##  $ BH-adjusted_p-value: num  3.70e-12 6.56e-12 2.19e-11 7.99e-11 8.75e-11 ...
##  $ Median_Expression  : num  253.5 1030.5 67.2 2556.8 21002 ...
##  $ Mean_Expression    : num  348.9 1430 97.3 3083.7 22508.3 ...
describe(gbm.precog)
##                                               vars     n     mean      sd
## GeneName*                                        1 23287 11644.00 6722.52
## precogdata.Name*                                 2 23261  9054.87 5650.82
## precogdata..Unweighted_meta.Z_of_all_cancers.    3 23287     0.10    2.16
## precogdata.Brain_cancer_Glioblastoma             4 23287    -0.03    1.04
##                                                 median  trimmed     mad
## GeneName*                                     11644.00 11644.00 8631.70
## precogdata.Name*                               8525.00  8851.28 7512.33
## precogdata..Unweighted_meta.Z_of_all_cancers.     0.05     0.05    1.34
## precogdata.Brain_cancer_Glioblastoma              0.00    -0.04    0.75
##                                                  min      max    range
## GeneName*                                       1.00 23287.00 23286.00
## precogdata.Name*                                1.00 19740.00 19739.00
## precogdata..Unweighted_meta.Z_of_all_cancers. -10.66    12.20    22.86
## precogdata.Brain_cancer_Glioblastoma           -5.40     4.69    10.09
##                                               skew kurtosis    se
## GeneName*                                     0.00    -1.20 44.05
## precogdata.Name*                              0.26    -1.19 37.05
## precogdata..Unweighted_meta.Z_of_all_cancers. 0.41     2.79  0.01
## precogdata.Brain_cancer_Glioblastoma          0.00     1.35  0.01
describe(gbm.oncolnc)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
##                     vars     n    mean      sd median trimmed    mad   min
## GeneName*              1 16840     NaN      NA     NA     NaN     NA   Inf
## Cox coefficient        2 16840    0.02    0.10   0.02    0.02   0.10 -0.38
## Raw p-value            3 16840    0.47    0.29   0.46    0.47   0.39  0.00
## BH-adjusted p-value    4 16840    0.90    0.08   0.91    0.90   0.08  0.04
## Median Expression      5 16840  994.94 3632.94 383.02  543.55 527.58  1.00
## Mean Expression        6 16840 1127.28 4233.28 436.56  609.75 576.47  1.21
##                           max     range  skew kurtosis    se
## GeneName*                -Inf      -Inf    NA       NA    NA
## Cox coefficient          0.52      0.91  0.09     0.02  0.00
## Raw p-value              1.00      1.00  0.10    -1.21  0.00
## BH-adjusted p-value      1.00      0.96 -1.02     2.14  0.00
## Median Expression   264903.35 264902.35 35.54  2058.42 28.00
## Mean Expression     309535.63 309534.43 35.76  2078.26 32.62
describe(gli.oncolnc)
## Warning: NAs introduced by coercion
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning
## Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning
## -Inf
##                     vars     n    mean      sd median trimmed    mad   min
## GeneName*              1 16822     NaN      NA     NA     NaN     NA   Inf
## Cox_coefficient        2 16822    0.03    0.28   0.03    0.03   0.30 -0.84
## Raw_p-value            3 16822    0.19    0.28   0.03    0.14   0.05  0.00
## BH-adjusted_p-value    4 16822    0.23    0.30   0.06    0.18   0.09  0.00
## Median_Expression      5 16822  997.63 3721.54 357.67  541.35 500.10  1.00
## Mean_Expression        6 16822 1114.84 4513.27 416.07  608.95 562.08  1.26
##                           max     range  skew kurtosis    se
## GeneName*                -Inf      -Inf    NA       NA    NA
## Cox_coefficient          0.88      1.72 -0.07    -0.53  0.00
## Raw_p-value              1.00      1.00  1.41     0.74  0.00
## BH-adjusted_p-value      1.00      1.00  1.14    -0.07  0.00
## Median_Expression   339528.01 339527.01 50.55  4188.38 28.69
## Mean_Expression     436000.93 435999.67 58.65  5246.81 34.80

Master Glioma and Glioblastoma dataframes

Combine different datasets from databases. Subset to contain only genes with significant p-values

gbm.df <- merge(gbm.precog, gbm.oncolnc, by = "GeneName")
gli.df <- merge(gli.precog, gli.oncolnc, by = "GeneName")
head(gbm.df)
##   GeneName                                    precogdata.Name
## 1     A1BG                             Alpha-1-B glycoprotein
## 2      A2M                              Alpha-2-macroglobulin
## 3    A2ML1                       Alpha-2-macroglobulin-like 1
## 4   A4GALT                    Alpha 1,4-galactosyltransferase
## 5     AAAS Achalasia, adrenocortical insufficiency, alacrimia
## 6     AACS                         Acetoacetyl-CoA synthetase
##   precogdata..Unweighted_meta.Z_of_all_cancers.
## 1                                      2.185478
## 2                                     -3.867906
## 3                                      1.682499
## 4                                     -1.959646
## 5                                      2.034914
## 6                                      3.167680
##   precogdata.Brain_cancer_Glioblastoma Cox coefficient Raw p-value
## 1                                0.559        -0.04370      0.6200
## 2                                0.931         0.09770      0.3200
## 3                                2.327        -0.03230      0.7200
## 4                                0.491         0.06420      0.4800
## 5                                1.500        -0.00354      0.9700
## 6                                0.755         0.26010      0.0081
##   BH-adjusted p-value Median Expression Mean Expression
## 1           0.9463247          162.3237        222.3394
## 2           0.8832650        18716.6609      22187.0128
## 3           0.9638156           43.9443        100.6319
## 4           0.9180239           82.3080        109.3797
## 5           0.9931177          830.5955        863.0091
## 6           0.6987552          450.6009        467.0564
head(gli.df)
##   GeneName                                    precogdata.Name
## 1     A1BG                             Alpha-1-B glycoprotein
## 2      A2M                              Alpha-2-macroglobulin
## 3    A2ML1                       Alpha-2-macroglobulin-like 1
## 4   A4GALT                    Alpha 1,4-galactosyltransferase
## 5     AAAS Achalasia, adrenocortical insufficiency, alacrimia
## 6     AACS                         Acetoacetyl-CoA synthetase
##   precogdata..Unweighted_meta.Z_of_all_cancers.
## 1                                      2.185478
## 2                                     -3.867906
## 3                                      1.682499
## 4                                     -1.959646
## 5                                      2.034914
## 6                                      3.167680
##   precogdata.Brain_cancer_Glioma Cox_coefficient Raw_p-value
## 1                          2.768          0.3334     0.00042
## 2                          3.145          0.2621     0.00540
## 3                          0.266          0.1944     0.03400
## 4                          2.310          0.0716     0.43000
## 5                         -0.505         -0.1889     0.04900
## 6                         -0.385         -0.0298     0.74000
##   BH-adjusted_p-value Median_Expression Mean_Expression
## 1         0.001630565          81.26865        99.26091
## 2         0.014291819       10952.64130     13228.99171
## 3         0.066910154         124.46575       180.39007
## 4         0.531793854          81.99415       105.68930
## 5         0.091322624         753.00550       766.96885
## 6         0.803322148         394.17055       491.00150
names(gbm.df)[2:4] <- c("GeneFunction", "Unweighted_meta_Z_allcancers", "precog_Z")
names(gli.df)[2:4] <- c("GeneFunction", "Unweighted_meta_Z_allcancers", "precog_Z")
sig.gbm<-unique(gbm.df[gbm.df$`Raw p-value` <0.05,])
sig.gli<-(gli.df[gli.df$`Raw_p-value`<0.05,])
gli.genes<-unique(sig.gli[sig.gli$GeneName %in% sig.gbm$GeneName,])
gbm.genes<-unique(sig.gbm[sig.gbm$GeneName %in% gli.genes$GeneName,])
remove <- "4748"
gli.genes <- gli.genes[!rownames(gli.genes) %in% remove,]

Standardize Values

standardize num variables to a mean of 0 with a standard deviation of 1 for comparison

gbm.scaled <- scale(gbm.genes[3:9])
gli.scaled <- scale(gli.genes[3:9])
glioblastoma<-cbind(gbm.genes[1:2], gbm.scaled)
glioma<-cbind(gli.genes[1:2], gli.scaled)
names(glioblastoma)[6:7] <-c("pvalue", "bh_pvalue")
names(glioma)[6:7] <-c("pvalue", "bh_pvalue")
describe(glioblastoma)
##                              vars   n     mean      sd   median  trimmed
## GeneName*                       1 512 12181.14 7721.54 13111.50 12227.34
## GeneFunction*                   2 512 10317.76 6227.18 10402.00 10346.91
## Unweighted_meta_Z_allcancers    3 512     0.00    1.00     0.01    -0.02
## precog_Z                        4 512     0.00    1.00    -0.03     0.01
## Cox coefficient                 5 512     0.00    1.00     0.51     0.08
## pvalue                          6 512     0.00    1.00     0.07     0.02
## bh_pvalue                       7 512     0.00    1.00     0.40     0.24
## Median Expression               8 512     0.00    1.00    -0.21    -0.16
## Mean Expression                 9 512     0.00    1.00    -0.19    -0.15
##                                   mad    min      max    range  skew
## GeneName*                    10734.02  76.00 23265.00 23189.00 -0.03
## GeneFunction*                 8478.25 187.00 19722.00 19535.00 -0.02
## Unweighted_meta_Z_allcancers     0.86  -3.47     4.05     7.52  0.32
## precog_Z                         1.01  -3.31     2.59     5.90 -0.13
## Cox coefficient                  0.28  -2.31     2.07     4.38 -0.79
## pvalue                           1.25  -1.79     1.65     3.45 -0.18
## bh_pvalue                        0.13  -9.76     0.61    10.36 -3.65
## Median Expression                0.09  -0.27    16.10    16.37 11.47
## Mean Expression                  0.09  -0.26    14.60    14.86 11.79
##                              kurtosis     se
## GeneName*                       -1.52 341.25
## GeneFunction*                   -1.40 275.21
## Unweighted_meta_Z_allcancers     1.25   0.04
## precog_Z                         0.04   0.04
## Cox coefficient                 -1.14   0.04
## pvalue                          -1.16   0.04
## bh_pvalue                       21.58   0.04
## Median Expression              158.63   0.04
## Mean Expression                158.32   0.04
describe(glioma)
##                              vars   n     mean      sd   median  trimmed
## GeneName*                       1 512 12181.14 7721.54 13111.50 12227.34
## GeneFunction*                   2 512 10317.76 6227.18 10402.00 10346.91
## Unweighted_meta_Z_allcancers    3 512     0.00    1.00     0.01    -0.02
## precog_Z                        4 512     0.00    1.00    -0.04     0.02
## Cox_coefficient                 5 512     0.00    1.00     0.44     0.05
## pvalue                          6 512     0.00    1.00    -0.53    -0.25
## bh_pvalue                       7 512     0.00    1.00    -0.54    -0.23
## Median_Expression               8 512     0.00    1.00    -0.24    -0.17
## Mean_Expression                 9 512     0.00    1.00    -0.25    -0.17
##                                   mad    min      max    range  skew
## GeneName*                    10734.02  76.00 23265.00 23189.00 -0.03
## GeneFunction*                 8478.25 187.00 19722.00 19535.00 -0.02
## Unweighted_meta_Z_allcancers     0.86  -3.47     4.05     7.52  0.32
## precog_Z                         1.15  -2.48     2.30     4.78 -0.18
## Cox_coefficient                  0.86  -2.17     1.58     3.75 -0.46
## pvalue                           0.07  -0.58     3.73     4.31  2.01
## bh_pvalue                        0.14  -0.63     3.46     4.09  1.78
## Median_Expression                0.12  -0.32    18.31    18.64 12.89
## Mean_Expression                  0.13  -0.35    17.92    18.26 12.18
##                              kurtosis     se
## GeneName*                       -1.52 341.25
## GeneFunction*                   -1.40 275.21
## Unweighted_meta_Z_allcancers     1.25   0.04
## precog_Z                        -0.68   0.04
## Cox_coefficient                 -1.31   0.04
## pvalue                           3.17   0.04
## bh_pvalue                        2.19   0.04
## Median_Expression              220.29   0.04
## Mean_Expression                201.85   0.04

Average Linkage Clustering using 50 genes

50 genes are randomly sampled from master list of glioblastoma genes (because it is the smaller set). The same 50 genes are used to cluster both glioma and glioblastoma. Sets are saved to csv file to reproduce results. In this markdownfile the actual sampling code is commented out and a previously generated set is loaded.

#Average-Linkage Clustering
#w/50 randomly sampled genes
#r50.glioblastoma <- glioblastoma[sample(1:nrow(glioblastoma), 50, replace=FALSE),]
#r50.glioma <- glioma[glioma$GeneName %in% r50.glioblastoma$GeneName,]

#save set to CSV to replicate later
#write.csv(r50.glioblastoma, "r50glioblastoma.csv")
#write.csv(r50.glioma, "r50glioma.csv")
r50.glioblastoma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r50glioblastoma.csv")
r50.glioma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r50glioma.csv")
row.names(r50.glioblastoma) <- r50.glioblastoma$GeneName
row.names(r50.glioma) <- r50.glioma$GeneName   
opar <- par(no.readonly=TRUE) #save current state

par(mfrow=c(2,1)) #layout dendogram hierarchy plots 2 rows, 1 column

#make glioblastoma cluster dendogram
nc.r50.glioblastoma <- r50.glioblastoma[3:9]                   
d.r50.glioblastoma <- dist(nc.r50.glioblastoma)                                          
fit.gbm.average50 <- hclust(d.r50.glioblastoma, method="average")                          
plot(fit.gbm.average50, hang=-1, cex=.8, main="50 gene Average Linkage Clustering for Glioblastoma")

#make glioma cluster dendogram
nc.r50.glioma <- r50.glioma[3:9]                           
d.r50.glioma <- dist(nc.r50.glioma)                                          
fit.gli.average50 <- hclust(d.r50.glioma, method="average")                          
plot(fit.gli.average50, hang=-1, cex=.8, main="50 gene Average Linkage Clustering for Glioma")

par(opar)

Selecting best number of clusters for both glioma50 and glioblastoma50

Shows best number of clusters to cluster each dataset using nbclust package. 5 clusters chosen for both cancers for comparability

nc.gli50 <- NbClust(nc.r50.glioma, distance="euclidean", 
                  min.nc=2, max.nc=10, method="average")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 2 proposed 2 as the best number of clusters 
## * 14 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
table(nc.gli50$Best.n[1,])
## 
##  0  1  2  3  4  6  7 10 
##  2  1  2 14  1  2  2  2
nc.gbm50 <- NbClust(nc.r50.glioblastoma, distance="euclidean", 
                    min.nc=2, max.nc=10, method="average")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 3 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 4 proposed 7 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 3 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
table(nc.gbm50$Best.n[1,])
## 
##  0  1  2  3  4  5  6  7  9 10 
##  2  1  6  2  2  3  2  4  1  3
par(opar)
par(mfrow=c(1,2)) #layout cluster bar plots next to each other plots 1rows, 2column
barplot(table(nc.gli50$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of GLI Clusters Chosen by 7 Criteria")

barplot(table(nc.gbm50$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of GBM Clusters Chosen by 7 Criteria") 

par(opar)

Create 5 Cluster dendogram for 50 glioma genes

gli.50.clusters <- cutree(fit.gli.average50, k=5) 
table(gli.50.clusters) #makes table showing how many genes in each cluster
## gli.50.clusters
##  1  2  3  4  5 
##  5 12 27  4  2
aggregate(nc.r50.glioma, by=list(cluster=gli.50.clusters), median) 
##   cluster Unweighted_meta_Z_allcancers   precog_Z Cox_coefficient
## 1       1                  -0.15391482  0.1825930       0.3300947
## 2       2                  -0.80796876 -1.3586532      -1.4624680
## 3       3                  -0.02566294  0.6188862       0.8056032
## 4       4                   2.26751030  1.4762832       0.8973953
## 5       5                  -1.32538543 -0.6004725      -0.2349430
##       pvalue  bh_pvalue Median_Expression Mean_Expression
## 1  0.9929519  1.1056295        -0.2763971      -0.3037636
## 2 -0.5782126 -0.6334471        -0.2689102      -0.2837911
## 3 -0.5743581 -0.6230121        -0.2907465      -0.3006326
## 4 -0.5741704 -0.6237840        -0.2687147      -0.2762618
## 5  2.9573277  2.8242766        -0.3024922      -0.3267051
aggregate(as.data.frame(nc.r50.glioma), by=list(cluster=gli.50.clusters),
          median)
##   cluster Unweighted_meta_Z_allcancers   precog_Z Cox_coefficient
## 1       1                  -0.15391482  0.1825930       0.3300947
## 2       2                  -0.80796876 -1.3586532      -1.4624680
## 3       3                  -0.02566294  0.6188862       0.8056032
## 4       4                   2.26751030  1.4762832       0.8973953
## 5       5                  -1.32538543 -0.6004725      -0.2349430
##       pvalue  bh_pvalue Median_Expression Mean_Expression
## 1  0.9929519  1.1056295        -0.2763971      -0.3037636
## 2 -0.5782126 -0.6334471        -0.2689102      -0.2837911
## 3 -0.5743581 -0.6230121        -0.2907465      -0.3006326
## 4 -0.5741704 -0.6237840        -0.2687147      -0.2762618
## 5  2.9573277  2.8242766        -0.3024922      -0.3267051
#plot clusters w/ red boxes indicating clusters
plot(fit.gli.average50, hang=-1, cex=.8,  
     main="GLI Average Linkage Clustering\n5 Cluster Solution") 
#overlay red rectangle on cluster
rect.hclust(fit.gli.average50, k=5)

par(opar)

Create 5 Cluster dendogram for 50 glioblastoma genes

gbm.50.clusters <- cutree(fit.gbm.average50, k=5) 
table(gbm.50.clusters)
## gbm.50.clusters
##  1  2  3  4  5 
##  7 27 12  3  1
aggregate(nc.r50.glioblastoma, by=list(cluster=gbm.50.clusters), median) 
##   cluster Unweighted_meta_Z_allcancers    precog_Z Cox.coefficient
## 1       1                   -0.6379634 -0.25487416       0.8907267
## 2       2                   -0.5070486 -0.27490099       0.4328913
## 3       3                    0.8711513  1.14736188       0.5451723
## 4       4                    0.3226721 -0.27490099      -1.8330551
## 5       5                   -2.1683410 -0.08750704       1.4187377
##       pvalue  bh_pvalue Median.Expression Mean.Expression
## 1 -1.4353588 -1.1645821        -0.2490808      -0.2217467
## 2  1.0214356  0.4871126        -0.2508916      -0.2365749
## 3  0.3174831  0.4012574        -0.2024280      -0.1849876
## 4 -1.7028608 -2.6239851        -0.2359394      -0.2260377
## 5 -1.7662165 -2.7880942        -0.2607908      -0.2423820
aggregate(as.data.frame(nc.r50.glioblastoma), by=list(cluster=gbm.50.clusters),
          median)
##   cluster Unweighted_meta_Z_allcancers    precog_Z Cox.coefficient
## 1       1                   -0.6379634 -0.25487416       0.8907267
## 2       2                   -0.5070486 -0.27490099       0.4328913
## 3       3                    0.8711513  1.14736188       0.5451723
## 4       4                    0.3226721 -0.27490099      -1.8330551
## 5       5                   -2.1683410 -0.08750704       1.4187377
##       pvalue  bh_pvalue Median.Expression Mean.Expression
## 1 -1.4353588 -1.1645821        -0.2490808      -0.2217467
## 2  1.0214356  0.4871126        -0.2508916      -0.2365749
## 3  0.3174831  0.4012574        -0.2024280      -0.1849876
## 4 -1.7028608 -2.6239851        -0.2359394      -0.2260377
## 5 -1.7662165 -2.7880942        -0.2607908      -0.2423820
plot(fit.gbm.average50, hang=-1, cex=.8,  
     main="GBM Average Linkage Clustering\n5 Cluster Solution")
rect.hclust(fit.gbm.average50, k=5)

par(opar)

Generate table with clustering information for 50 Glioma Genes

gli.50.clist <- lapply(sort(unique(gli.50.clusters)), function(x) r50.glioma[which(gli.50.clusters==x),])

gli.cluster.50.1 <- cbind(gli.50.clist[[1]], Cluster = "Gli.Cluster1")
gli.cluster.50.2 <- cbind(gli.50.clist[[2]], Cluster = "Gli.Cluster2")
gli.cluster.50.3 <- cbind(gli.50.clist[[3]], Cluster = "Gli.Cluster3")
gli.cluster.50.4 <- cbind(gli.50.clist[[4]], Cluster = "Gli.Cluster4")
gli.cluster.50.5 <- cbind(gli.50.clist[[5]], Cluster = "Gli.Cluster5")
#gli.cluster.50.6 <- cbind(gli.50.clist[[6]], Cluster = "Gli.Cluster6")
#gli.cluster.50.7 <- cbind(gli.50.clist[[7]], Cluster = "Gli.Cluster7")
#gli.cluster.50.8 <- cbind(gli.50.clist[[8]], Cluster = "Gli.Cluster8")
glioma_50_Genes_C <- rbind(gli.cluster.50.1, gli.cluster.50.2, gli.cluster.50.3, gli.cluster.50.4,
                                 gli.cluster.50.5)# gli.cluster.50.6,gli.cluster.50.7, gli.cluster.50.8)
#save table to csv
write.csv(glioma_50_Genes_C, "Table_Glioma_50genes_5Clust.csv")
glioma_50_Genes_C[,-c(1,3:9)]
##                                                                         GeneFunction
## AK7                                                               Adenylate kinase 7
## CUX1                                                             Cut-like homeobox 1
## DKFZP586I1420                                     Hypothetical protein DKFZp586I1420
## FSTL3                                     Follistatin-like 3 (secreted glycoprotein)
## SCN9A                          Sodium channel, voltage-gated, type IX, alpha subunit
## ANK1                                                         Ankyrin 1, erythrocytic
## ANO4                                                                     Anoctamin 4
## CYP2E1                         Cytochrome P450, family 2, subfamily E, polypeptide 1
## DLK2                                               Delta-like 2 homolog (Drosophila)
## EHD3                                                          EH-domain containing 3
## GRHPR                                 Glyoxylate reductase/hydroxypyruvate reductase
## HECW1                HECT, C2 and WW domain containing E3 ubiquitin protein ligase 1
## HMX1                                                            H6 family homeobox 1
## RIIAD1                                                                Data not found
## SGSM1                                          Small G protein signaling modulator 1
## SUGT1P3                 Suppressor of G2 allele of SKP1 (S. cerevisiae) pseudogene 3
## WNT7B                          Wingless-type MMTV integration site family, member 7B
## AQP9                                                                     Aquaporin 9
## B3GALNT2                                Beta-1,3-N-acetylgalactosaminyltransferase 2
## BIRC3                                            Baculoviral IAP repeat-containing 3
## CCDC69                                              Coiled-coil domain containing 69
## CPA2                                                Carboxypeptidase A2 (pancreatic)
## DENND2D                                               DENN/MADD domain containing 2D
## DHX15                                      DEAH (Asp-Glu-Ala-His) box polypeptide 15
## DUSP5                                                 Dual specificity phosphatase 5
## FAM50B                                  Family with sequence similarity 50, member B
## FAM86B2                                Family with sequence similarity 86, member B2
## FGR                    Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## G0S2                                                                   G0/G1switch 2
## GJB2                                             Gap junction protein, beta 2, 26kDa
## PHKB                                                      Phosphorylase kinase, beta
## PPARG                               Peroxisome proliferator-activated receptor gamma
## PRPF38A              PRP38 pre-mRNA processing factor 38 (yeast) domain containing A
## PTRF                                      Polymerase I and transcript release factor
## RASSF10           Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## S100PBP                                                        S100P binding protein
## SHISA5                                              Shisa homolog 5 (Xenopus laevis)
## VASN                                                                         Vasorin
## ZMYM1                                                        Zinc finger, MYM-type 1
## ZNF175                                                       Zinc finger protein 175
## ZNF432                                                       Zinc finger protein 432
## ZNF543                                                       Zinc finger protein 543
## ZNF644                                                       Zinc finger protein 644
## ZRANB2                                  Zinc finger, RAN-binding domain containing 2
## CKAP4                                              Cytoskeleton-associated protein 4
## OPN3                                                                         Opsin 3
## PTX3                                                               Pentraxin 3, long
## SLC16A3       Solute carrier family 16, member 3 (monocarboxylic acid transporter 4)
## DAPK2                                              Death-associated protein kinase 2
## PIP5KL1                             Phosphatidylinositol-4-phosphate 5-kinase-like 1
##                    Cluster
## AK7           Gli.Cluster1
## CUX1          Gli.Cluster1
## DKFZP586I1420 Gli.Cluster1
## FSTL3         Gli.Cluster1
## SCN9A         Gli.Cluster1
## ANK1          Gli.Cluster2
## ANO4          Gli.Cluster2
## CYP2E1        Gli.Cluster2
## DLK2          Gli.Cluster2
## EHD3          Gli.Cluster2
## GRHPR         Gli.Cluster2
## HECW1         Gli.Cluster2
## HMX1          Gli.Cluster2
## RIIAD1        Gli.Cluster2
## SGSM1         Gli.Cluster2
## SUGT1P3       Gli.Cluster2
## WNT7B         Gli.Cluster2
## AQP9          Gli.Cluster3
## B3GALNT2      Gli.Cluster3
## BIRC3         Gli.Cluster3
## CCDC69        Gli.Cluster3
## CPA2          Gli.Cluster3
## DENND2D       Gli.Cluster3
## DHX15         Gli.Cluster3
## DUSP5         Gli.Cluster3
## FAM50B        Gli.Cluster3
## FAM86B2       Gli.Cluster3
## FGR           Gli.Cluster3
## G0S2          Gli.Cluster3
## GJB2          Gli.Cluster3
## PHKB          Gli.Cluster3
## PPARG         Gli.Cluster3
## PRPF38A       Gli.Cluster3
## PTRF          Gli.Cluster3
## RASSF10       Gli.Cluster3
## S100PBP       Gli.Cluster3
## SHISA5        Gli.Cluster3
## VASN          Gli.Cluster3
## ZMYM1         Gli.Cluster3
## ZNF175        Gli.Cluster3
## ZNF432        Gli.Cluster3
## ZNF543        Gli.Cluster3
## ZNF644        Gli.Cluster3
## ZRANB2        Gli.Cluster3
## CKAP4         Gli.Cluster4
## OPN3          Gli.Cluster4
## PTX3          Gli.Cluster4
## SLC16A3       Gli.Cluster4
## DAPK2         Gli.Cluster5
## PIP5KL1       Gli.Cluster5

Generate table with clustering information for 50 Glioblastoma Genes

gbm.50.clist <- lapply(sort(unique(gbm.50.clusters)), function(x) r50.glioblastoma[which(gbm.50.clusters==x),])

gbm.cluster.50.1 <- cbind(gbm.50.clist[[1]], Cluster = "Gbm.Cluster1")
gbm.cluster.50.2 <- cbind(gbm.50.clist[[2]], Cluster = "Gbm.Cluster2")
gbm.cluster.50.3 <- cbind(gbm.50.clist[[3]], Cluster = "Gbm.Cluster3")
gbm.cluster.50.4 <- cbind(gbm.50.clist[[4]], Cluster = "Gbm.Cluster4")
gbm.cluster.50.5 <- cbind(gbm.50.clist[[5]], Cluster = "Gbm.Cluster5")
#gbm.cluster.50.6 <- cbind(gbm.50.clist[[6]], Cluster = "Gbm.Cluster6")
#gbm.cluster.50.7 <- cbind(gbm.50.clist[[7]], Cluster = "Gbm.Cluster7")
#gbm.cluster.50.8 <- cbind(gbm.50.clist[[8]], Cluster = "Gbm.Cluster8")
glioblastoma_50_Genes_C <- rbind(gbm.cluster.50.1, gbm.cluster.50.2, gbm.cluster.50.3, gbm.cluster.50.4,
                               gbm.cluster.50.5) #gbm.cluster.50.6,gbm.cluster.50.7, gbm.cluster.50.8)

#save cluster table results into csv
#write.csv(glioblastoma_50_Genes_C, "Table_Glioblastoma_50genes_5Clust.csv")

glioblastoma_50_Genes_C[,-c(1,3:9)]
##                                                                         GeneFunction
## SGSM1                                          Small G protein signaling modulator 1
## CUX1                                                             Cut-like homeobox 1
## OPN3                                                                         Opsin 3
## BIRC3                                            Baculoviral IAP repeat-containing 3
## DENND2D                                               DENN/MADD domain containing 2D
## ANO4                                                                     Anoctamin 4
## GJB2                                             Gap junction protein, beta 2, 26kDa
## DHX15                                      DEAH (Asp-Glu-Ala-His) box polypeptide 15
## PIP5KL1                             Phosphatidylinositol-4-phosphate 5-kinase-like 1
## RIIAD1                                                                Data not found
## ZMYM1                                                        Zinc finger, MYM-type 1
## ZNF175                                                       Zinc finger protein 175
## S100PBP                                                        S100P binding protein
## CPA2                                                Carboxypeptidase A2 (pancreatic)
## FAM86B2                                Family with sequence similarity 86, member B2
## PHKB                                                      Phosphorylase kinase, beta
## PRPF38A              PRP38 pre-mRNA processing factor 38 (yeast) domain containing A
## SCN9A                          Sodium channel, voltage-gated, type IX, alpha subunit
## EHD3                                                          EH-domain containing 3
## SHISA5                                              Shisa homolog 5 (Xenopus laevis)
## DLK2                                               Delta-like 2 homolog (Drosophila)
## WNT7B                          Wingless-type MMTV integration site family, member 7B
## HMX1                                                            H6 family homeobox 1
## ZNF543                                                       Zinc finger protein 543
## CYP2E1                         Cytochrome P450, family 2, subfamily E, polypeptide 1
## HECW1                HECT, C2 and WW domain containing E3 ubiquitin protein ligase 1
## DKFZP586I1420                                     Hypothetical protein DKFZp586I1420
## CCDC69                                              Coiled-coil domain containing 69
## ZNF644                                                       Zinc finger protein 644
## DAPK2                                              Death-associated protein kinase 2
## ZRANB2                                  Zinc finger, RAN-binding domain containing 2
## GRHPR                                 Glyoxylate reductase/hydroxypyruvate reductase
## AK7                                                               Adenylate kinase 7
## RASSF10           Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## VASN                                                                         Vasorin
## PPARG                               Peroxisome proliferator-activated receptor gamma
## CKAP4                                              Cytoskeleton-associated protein 4
## DUSP5                                                 Dual specificity phosphatase 5
## FAM50B                                  Family with sequence similarity 50, member B
## FGR                    Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## FSTL3                                     Follistatin-like 3 (secreted glycoprotein)
## PTX3                                                               Pentraxin 3, long
## AQP9                                                                     Aquaporin 9
## SLC16A3       Solute carrier family 16, member 3 (monocarboxylic acid transporter 4)
## G0S2                                                                   G0/G1switch 2
## PTRF                                      Polymerase I and transcript release factor
## ZNF432                                                       Zinc finger protein 432
## SUGT1P3                 Suppressor of G2 allele of SKP1 (S. cerevisiae) pseudogene 3
## B3GALNT2                                Beta-1,3-N-acetylgalactosaminyltransferase 2
## ANK1                                                         Ankyrin 1, erythrocytic
##                    Cluster
## SGSM1         Gbm.Cluster1
## CUX1          Gbm.Cluster1
## OPN3          Gbm.Cluster1
## BIRC3         Gbm.Cluster1
## DENND2D       Gbm.Cluster1
## ANO4          Gbm.Cluster1
## GJB2          Gbm.Cluster1
## DHX15         Gbm.Cluster2
## PIP5KL1       Gbm.Cluster2
## RIIAD1        Gbm.Cluster2
## ZMYM1         Gbm.Cluster2
## ZNF175        Gbm.Cluster2
## S100PBP       Gbm.Cluster2
## CPA2          Gbm.Cluster2
## FAM86B2       Gbm.Cluster2
## PHKB          Gbm.Cluster2
## PRPF38A       Gbm.Cluster2
## SCN9A         Gbm.Cluster2
## EHD3          Gbm.Cluster2
## SHISA5        Gbm.Cluster2
## DLK2          Gbm.Cluster2
## WNT7B         Gbm.Cluster2
## HMX1          Gbm.Cluster2
## ZNF543        Gbm.Cluster2
## CYP2E1        Gbm.Cluster2
## HECW1         Gbm.Cluster2
## DKFZP586I1420 Gbm.Cluster2
## CCDC69        Gbm.Cluster2
## ZNF644        Gbm.Cluster2
## DAPK2         Gbm.Cluster2
## ZRANB2        Gbm.Cluster2
## GRHPR         Gbm.Cluster2
## AK7           Gbm.Cluster2
## RASSF10       Gbm.Cluster2
## VASN          Gbm.Cluster3
## PPARG         Gbm.Cluster3
## CKAP4         Gbm.Cluster3
## DUSP5         Gbm.Cluster3
## FAM50B        Gbm.Cluster3
## FGR           Gbm.Cluster3
## FSTL3         Gbm.Cluster3
## PTX3          Gbm.Cluster3
## AQP9          Gbm.Cluster3
## SLC16A3       Gbm.Cluster3
## G0S2          Gbm.Cluster3
## PTRF          Gbm.Cluster3
## ZNF432        Gbm.Cluster4
## SUGT1P3       Gbm.Cluster4
## B3GALNT2      Gbm.Cluster4
## ANK1          Gbm.Cluster5

Hierarchical Clustering of 100 genes using Ward Method

For markdownfile a previously generated random gene list was loaded

#generate 100 random genes that are significant in both gli and gbm
#grabs from glioblastoma & glioma dfs (contains same genes)
#r100.glioblastoma <- glioblastoma[sample(1:nrow(glioblastoma), 100, replace=FALSE),]
#r100.glioma <- glioma[glioma$GeneName %in% r100.glioblastoma$GeneName,]

#save randomly generated gene df into csv for reproducibility 
#write.csv(r100.glioblastoma, "r100glioblastoma.csv")
#write.csv(r100.glioma, "r100glioma.csv")

r100.glioblastoma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r100glioblastoma.csv")
r100.glioma <-read.csv("/Users/louisecabansay/Dropbox (Personal)/R Projects/Clustering Project/r100glioma.csv")
row.names(r100.glioblastoma) <- r100.glioblastoma$GeneName
row.names(r100.glioma) <- r100.glioma$GeneName

par(opar)
nc.r100.glioblastoma <- r100.glioblastoma[3:9] 
d.r100.glioblastoma <- dist(nc.r100.glioblastoma)                                          
fit.gbm.ward100 <- hclust(d.r100.glioblastoma, method="ward")                          
plot(fit.gbm.ward100, hang=-1, cex=.8, main="100 Gene Ward Method Clustering: Glioblastoma")

nc.r100.glioma <- r100.glioma[3:9]
d.r100.glioma <- dist(nc.r100.glioma)                                          
fit.gli.ward100 <- hclust(d.r100.glioma, method="ward")                          
plot(fit.gli.ward100, hang=-1, cex=.8, main="100 Gene Ward Method Clustering: Glioma")

par(opar)

Best Ward Clusters for 100 genes

#find best number of clusters for 100 genes hierarchial clustering

#find for 100 gli genes
nc.gli.100 <- NbClust(nc.r100.glioma, distance="euclidean", 
              min.nc=2, max.nc=20, method="ward.D")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 1 proposed 2 as the best number of clusters 
## * 17 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 17 as the best number of clusters 
## * 3 proposed 20 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
table(nc.gli.100$Best.n[1,])
## 
##  0  1  2  3 10 17 20 
##  2  1  1 17  1  1  3
barplot(table(nc.gli.100$Best.n[1,]), 
        xlab="Number of Clusters", ylab="Number of Criteria",
        main="Number of Ward Clusters Chosen by 7 Criteria: 100 GLI") 
par(opar)

#find for 100 gbm genes
nc.gbm.100 <- NbClust(nc.r100.glioblastoma, distance="euclidean", 
                  min.nc=2, max.nc=20, method="ward.D")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 16 as the best number of clusters 
## * 2 proposed 17 as the best number of clusters 
## * 1 proposed 18 as the best number of clusters 
## * 2 proposed 20 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
table(nc.gbm.100$Best.n[1,])
## 
##  0  1  2  3  4  5  6  7 16 17 18 20 
##  2  1  6  2  5  1  1  2  1  2  1  2
barplot(table(nc.gbm.100$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Ward Clusters Chosen by 7 Criteria: 100 GBM") 
par(opar)

Glioma 100 gene Ward method with 10 clusters

#plot 100 gene glioma dendogram w/red boxes around 10 clusters
gli.100.clusters <- cutree(fit.gli.ward100, k=10) 
table(gli.100.clusters)
## gli.100.clusters
##  1  2  3  4  5  6  7  8  9 10 
## 19 11 18  7  6 12  6  5  8  8
aggregate(nc.r100.glioma, by=list(cluster=gli.100.clusters), median) 
##    cluster Unweighted_meta_Z_allcancers   precog_Z Cox_coefficient
## 1        1                  -0.33723915  0.4964010       0.6901913
## 2        2                  -0.06897342  0.3002306       0.4129763
## 3        3                   0.98065569  1.1576276       0.8579346
## 4        4                  -0.38238700 -0.2284922       0.2786119
## 5        5                  -0.37008446 -1.1458391      -0.9033706
## 6        6                  -0.61686806 -1.5449666      -1.4505873
## 7        7                  -1.49974619  0.4718393       0.8993754
## 8        8                   1.25698724  0.1825930       0.3300947
## 9        9                   0.16445403 -0.1657952      -1.4911795
## 10      10                   0.41196540  0.6825527       0.8172010
##        pvalue  bh_pvalue Median_Expression Mean_Expression
## 1  -0.5436265 -0.5653951       -0.26781673     -0.26493921
## 2   0.1810099  0.3120433       -0.26704794     -0.28712766
## 3  -0.5748383 -0.6246490       -0.24344459     -0.26957030
## 4   2.3898414  2.3416708       -0.24959806     -0.27818242
## 5   0.6000767  0.7290179       -0.09670977     -0.08369619
## 6  -0.5779551 -0.6325875        0.02690056      0.12284067
## 7  -0.5736728 -0.6228476       -0.15754173     -0.18220353
## 8   2.0406190  2.0450908       -0.25866463     -0.27201661
## 9  -0.5780162 -0.6327654       -0.30060345     -0.32510579
## 10 -0.5757986 -0.6268340        0.67110837      0.65836299
aggregate(as.data.frame(nc.r100.glioma), by=list(cluster=gli.100.clusters),
          median)
##    cluster Unweighted_meta_Z_allcancers   precog_Z Cox_coefficient
## 1        1                  -0.33723915  0.4964010       0.6901913
## 2        2                  -0.06897342  0.3002306       0.4129763
## 3        3                   0.98065569  1.1576276       0.8579346
## 4        4                  -0.38238700 -0.2284922       0.2786119
## 5        5                  -0.37008446 -1.1458391      -0.9033706
## 6        6                  -0.61686806 -1.5449666      -1.4505873
## 7        7                  -1.49974619  0.4718393       0.8993754
## 8        8                   1.25698724  0.1825930       0.3300947
## 9        9                   0.16445403 -0.1657952      -1.4911795
## 10      10                   0.41196540  0.6825527       0.8172010
##        pvalue  bh_pvalue Median_Expression Mean_Expression
## 1  -0.5436265 -0.5653951       -0.26781673     -0.26493921
## 2   0.1810099  0.3120433       -0.26704794     -0.28712766
## 3  -0.5748383 -0.6246490       -0.24344459     -0.26957030
## 4   2.3898414  2.3416708       -0.24959806     -0.27818242
## 5   0.6000767  0.7290179       -0.09670977     -0.08369619
## 6  -0.5779551 -0.6325875        0.02690056      0.12284067
## 7  -0.5736728 -0.6228476       -0.15754173     -0.18220353
## 8   2.0406190  2.0450908       -0.25866463     -0.27201661
## 9  -0.5780162 -0.6327654       -0.30060345     -0.32510579
## 10 -0.5757986 -0.6268340        0.67110837      0.65836299
plot(fit.gli.ward100, hang=-1, cex=.8,  
     main="Glioma 100 Gene Ward Method Clustering\n10 Cluster Solution")
#outline k clusters in red boxes on dendogram
rect.hclust(fit.gli.ward100, k=10)

Glioblastoma 100 gene Ward method with 10 clusters

par(opar)
#plot 100 gene glioblastoma dendogram w/redboxes around 10 clusters
gbm.100.clusters <- cutree(fit.gbm.ward100, k=10) 
table(gbm.100.clusters)
## gbm.100.clusters
##  1  2  3  4  5  6  7  8  9 10 
## 20 13  7 14  7  7 11  9  7  5
aggregate(nc.r100.glioblastoma, by=list(cluster=gbm.100.clusters), median) 
##    cluster Unweighted_meta_Z_allcancers   precog_Z Cox.coefficient
## 1        1                   0.44741374 -0.4007840      -1.3999022
## 2        2                  -0.01287706  0.1406558       0.7087541
## 3        3                   0.86335265  1.5668525       0.8096134
## 4        4                  -0.02079043 -0.2402117       0.5475921
## 5        5                  -0.63228624  0.6921090       0.4590257
## 6        6                   0.03562454 -0.4529968       0.9667099
## 7        7                  -0.99496397 -0.7083389       0.5437204
## 8        8                   1.13109766  1.1448585       0.5258135
## 9        9                  -0.66394158 -0.8184865      -1.8190200
## 10      10                  -0.64561638 -1.9292606      -1.4371679
##         pvalue  bh_pvalue Median.Expression Mean.Expression
## 1   0.45827358  0.4161397       -0.21235303     -0.20642396
## 2  -0.87923628  0.1080645        0.09196304      0.06829423
## 3  -1.09042205 -0.5376288       -0.15761511     -0.16384789
## 4   0.59906410  0.4737663       -0.24992118     -0.23029184
## 5   0.73985461  0.4835080       -0.19285941     -0.18598142
## 6  -1.60430742 -2.0615623       -0.22935145     -0.19490796
## 7  -0.03449321  0.4012574       -0.20762862     -0.20072240
## 8   0.38787833  0.4012574       -0.20532013     -0.18526386
## 9  -1.53391216 -1.5142691       -0.16881740     -0.17033129
## 10  0.66945935  0.4835080        0.10732303      0.04213735
aggregate(as.data.frame(nc.r100.glioblastoma), by=list(cluster=gbm.100.clusters),
          median)
##    cluster Unweighted_meta_Z_allcancers   precog_Z Cox.coefficient
## 1        1                   0.44741374 -0.4007840      -1.3999022
## 2        2                  -0.01287706  0.1406558       0.7087541
## 3        3                   0.86335265  1.5668525       0.8096134
## 4        4                  -0.02079043 -0.2402117       0.5475921
## 5        5                  -0.63228624  0.6921090       0.4590257
## 6        6                   0.03562454 -0.4529968       0.9667099
## 7        7                  -0.99496397 -0.7083389       0.5437204
## 8        8                   1.13109766  1.1448585       0.5258135
## 9        9                  -0.66394158 -0.8184865      -1.8190200
## 10      10                  -0.64561638 -1.9292606      -1.4371679
##         pvalue  bh_pvalue Median.Expression Mean.Expression
## 1   0.45827358  0.4161397       -0.21235303     -0.20642396
## 2  -0.87923628  0.1080645        0.09196304      0.06829423
## 3  -1.09042205 -0.5376288       -0.15761511     -0.16384789
## 4   0.59906410  0.4737663       -0.24992118     -0.23029184
## 5   0.73985461  0.4835080       -0.19285941     -0.18598142
## 6  -1.60430742 -2.0615623       -0.22935145     -0.19490796
## 7  -0.03449321  0.4012574       -0.20762862     -0.20072240
## 8   0.38787833  0.4012574       -0.20532013     -0.18526386
## 9  -1.53391216 -1.5142691       -0.16881740     -0.17033129
## 10  0.66945935  0.4835080        0.10732303      0.04213735
plot(fit.gbm.ward100, hang=-1, cex=.8,  
     main="Glioblastoma 100 Gene Ward Method Clustering\n10 Cluster Solution")
#outline k clusters in red boxes on dendogram
rect.hclust(fit.gbm.ward100, k=10)

#K-means cluster plot of 100 glioblastoma genes
km.100.glioblastoma <- kmeans(nc.r100.glioblastoma, 4, iter.max = 10, nstart=25) 
km.100.glioblastoma$size
## [1] 25 13 41 21
km.100.glioblastoma$centers                                               
##   Unweighted_meta_Z_allcancers    precog_Z Cox.coefficient     pvalue
## 1                   0.08758045 -0.78979091      -1.4192803  0.4948791
## 2                  -0.44739658 -0.71538132      -0.5587032 -1.5204288
## 3                  -0.32099130 -0.02533314       0.5321759  0.5132162
## 4                   0.79368490  1.00398950       0.7440931 -0.8745433
##    bh_pvalue Median.Expression Mean.Expression
## 1  0.4391912        0.01458218     -0.01341659
## 2 -1.7269062       -0.17738145     -0.17116306
## 3  0.4379154       -0.05137820     -0.05182371
## 4 -0.3718029        0.03146196      0.02840369
aggregate(nc.r100.glioblastoma, by=list(cluster=km.100.glioblastoma$cluster), mean)
##   cluster Unweighted_meta_Z_allcancers    precog_Z Cox.coefficient
## 1       1                   0.08758045 -0.78979091      -1.4192803
## 2       2                  -0.44739658 -0.71538132      -0.5587032
## 3       3                  -0.32099130 -0.02533314       0.5321759
## 4       4                   0.79368490  1.00398950       0.7440931
##       pvalue  bh_pvalue Median.Expression Mean.Expression
## 1  0.4948791  0.4391912        0.01458218     -0.01341659
## 2 -1.5204288 -1.7269062       -0.17738145     -0.17116306
## 3  0.5132162  0.4379154       -0.05137820     -0.05182371
## 4 -0.8745433 -0.3718029        0.03146196      0.02840369
clusplot(nc.r100.glioblastoma, km.100.glioblastoma$cluster, color=TRUE, shade=TRUE, 
         labels=1, lines=0, main = "Glioblastoma 100 gene K-means Cluster Plot, Outliers Removed")

Tables for Ward clusters

#Generate tables for ward clustering results

#100 gene glioma 10 cluster table

gli.100.clist <- lapply(sort(unique(gli.100.clusters)), function(x) r100.glioma[which(gli.100.clusters==x),])
gli.cluster.100.1 <- cbind(gli.100.clist[[1]], Cluster = "C1")
gli.cluster.100.2 <- cbind(gli.100.clist[[2]], Cluster = "C2")
gli.cluster.100.3 <- cbind(gli.100.clist[[3]], Cluster = "C3")
gli.cluster.100.4 <- cbind(gli.100.clist[[4]], Cluster = "C4")
gli.cluster.100.5 <- cbind(gli.100.clist[[5]], Cluster = "C5")
gli.cluster.100.6 <- cbind(gli.100.clist[[6]], Cluster = "C6")
gli.cluster.100.7 <- cbind(gli.100.clist[[7]], Cluster = "C7")
gli.cluster.100.8 <- cbind(gli.100.clist[[8]], Cluster = "C8")
gli.cluster.100.9 <- cbind(gli.100.clist[[9]], Cluster = "C9")
gli.cluster.100.10 <- cbind(gli.100.clist[[10]], Cluster = "C10")
glioma_100_Genes_C <- rbind(gli.cluster.100.1, gli.cluster.100.2, gli.cluster.100.3, gli.cluster.100.4,
                           gli.cluster.100.5,gli.cluster.100.6,gli.cluster.100.7, gli.cluster.100.8, 
                           gli.cluster.100.9, gli.cluster.100.10)

glioma_100_Genes_C[,-c(1,3:9)]
##                                                                                GeneFunction
## ABCB7                                ATP-binding cassette, sub-family B (MDR/TAP), member 7
## ADPRH                                                         ADP-ribosylarginine hydrolase
## C2                                                                   Complement component 2
## CCNL2                                                                             Cyclin L2
## CPM                                                                      Carboxypeptidase M
## DHH                                                                         Desert hedgehog
## FGR                           Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## FKBP9                                                       FK506 binding protein 9, 63 kDa
## HS6ST1                                               Heparan sulfate 6-O-sulfotransferase 1
## MALT1                       Mucosa associated lymphoid tissue lymphoma translocation gene 1
## OSCAR                                   Osteoclast associated, immunoglobulin-like receptor
## PHKB                                                             Phosphorylase kinase, beta
## RASSF10                  Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## RRN3P2           RNA polymerase I transcription factor homolog (S. cerevisiae) pseudogene 2
## SNX10                                                                      Sorting nexin 10
## SULF1                                                                           Sulfatase 1
## ZNF235                                                                 In multiple clusters
## ZNF432                                                              Zinc finger protein 432
## ZNF69                                                                Zinc finger protein 69
## ADAMTS14                         ADAM metallopeptidase with thrombospondin type 1 motif, 14
## CCDC114                                                   Coiled-coil domain containing 114
## CD1D                                                                          CD1d molecule
## CMPK1                                  Cytidine monophosphate (UMP-CMP) kinase 1, cytosolic
## KLF10                                                                Kruppel-like factor 10
## LMO7                                                                           LIM domain 7
## PPM1J                                          Protein phosphatase, Mg2+/Mn2+ dependent, 1J
## SARDH                                                               Sarcosine dehydrogenase
## TMTC1                               Transmembrane and tetratricopeptide repeat containing 1
## TNIP1                                                         TNFAIP3 interacting protein 1
## ZNF175                                                              Zinc finger protein 175
## AEBP1                                                                  AE binding protein 1
## ARMC10                                                       Armadillo repeat containing 10
## CDCP1                                                       CUB domain containing protein 1
## CNPY4                                                          Canopy 4 homolog (zebrafish)
## EPHB2                                                                       EPH receptor B2
## ERRFI1                                                   ERBB receptor feedback inhibitor 1
## GALE                                                              UDP-galactose-4-epimerase
## HMOX1                                                          Heme oxygenase (decycling) 1
## ISG20                                          Interferon stimulated exonuclease gene 20kDa
## LOXL1                                                                  Lysyl oxidase-like 1
## PODNL1                                                                       Podocan-like 1
## PTX3                                                                      Pentraxin 3, long
## SKP2                                              S-phase kinase-associated protein 2 (p45)
## SPAG4                                                            Sperm associated antigen 4
## UBE2J2                            Ubiquitin-conjugating enzyme E2, J2 (UBC6 homolog, yeast)
## ZNF320                                                              Zinc finger protein 320
## ZNF480                                                              Zinc finger protein 480
## ZNF765                                                              Zinc finger protein 765
## ALOX15B                                                Arachidonate 15-lipoxygenase, type B
## CHST1                               Carbohydrate (keratan sulfate Gal-6) sulfotransferase 1
## FIGF                     C-fos induced growth factor (vascular endothelial growth factor D)
## KPNA5                                                Karyopherin alpha 5 (importin alpha 6)
## RCOR3                                                                    REST corepressor 3
## UBE2Z                                                      Ubiquitin-conjugating enzyme E2Z
## ZNF136                                                              Zinc finger protein 136
## ANKH                                                 Ankylosis, progressive homolog (mouse)
## GUCA1B                                              Guanylate cyclase activator 1B (retina)
## INPP5F                                               Inositol polyphosphate-5-phosphatase F
## NRCAM                                                       Neuronal cell adhesion molecule
## SLC25A48                                                Solute carrier family 25, member 48
## ZBTB6                                               Zinc finger and BTB domain containing 6
## AP3B2                                     Adaptor-related protein complex 3, beta 2 subunit
## APBB1                Amyloid beta (A4) precursor protein-binding, family B, member 1 (Fe65)
## FAM149A                                       Family with sequence similarity 149, member A
## GDI2                                                           GDP dissociation inhibitor 2
## GRHPR                                        Glyoxylate reductase/hydroxypyruvate reductase
## L1CAM                                                             L1 cell adhesion molecule
## NRXN2                                                                            Neurexin 2
## OLFM1                                                                        Olfactomedin 1
## RAB3C                                                     RAB3C, member RAS oncogene family
## RRAGB                                                             Ras-related GTP binding B
## TAF9B      TAF9B RNA polymerase II, TATA box binding protein (TBP)-associated factor, 31kDa
## TTC9                                                      Tetratricopeptide repeat domain 9
## CCBL2                                                       Cysteine conjugate-beta lyase 2
## H6PD                             Hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase)
## MAN2B2                                               Mannosidase, alpha, class 2B, member 2
## MIER1                          Mesoderm induction early response 1 homolog (Xenopus laevis)
## PNPLA4                                       Patatin-like phospholipase domain containing 4
## SLC9A1                        Solute carrier family 9 (sodium/hydrogen exchanger), member 1
## EPS8L2                                                                          EPS8-like 2
## FSTL3                                            Follistatin-like 3 (secreted glycoprotein)
## NSUN5                                                      NOP2/Sun domain family, member 5
## TSEN15                                TRNA splicing endonuclease 15 homolog (S. cerevisiae)
## ZNF616                                                                 In multiple clusters
## FLJ23867                                                      Hypothetical protein FLJ23867
## HIST3H2A                                                             Histone cluster 3, H2a
## HMX1                                                                   H6 family homeobox 1
## MCM3AP-AS1                                      MCM3AP antisense RNA 1 (non-protein coding)
## PNMA5                                                         Paraneoplastic antigen like 5
## SCARB1                                                                       Data not found
## SLC9A2                        Solute carrier family 9 (sodium/hydrogen exchanger), member 2
## VWC2L                              Von Willebrand factor C domain-containing protein 2-like
## GRN                                                                                Granulin
## ID3                  Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein
## ITGB5                                                                      Integrin, beta 5
## ORAI2                                    ORAI calcium release-activated calcium modulator 2
## PLD3                                                       Phospholipase D family, member 3
## RBBP4                                                      Retinoblastoma binding protein 4
## SERBP1                                                      SERPINE1 mRNA binding protein 1
## SHISA5                                                     Shisa homolog 5 (Xenopus laevis)
##            Cluster
## ABCB7           C1
## ADPRH           C1
## C2              C1
## CCNL2           C1
## CPM             C1
## DHH             C1
## FGR             C1
## FKBP9           C1
## HS6ST1          C1
## MALT1           C1
## OSCAR           C1
## PHKB            C1
## RASSF10         C1
## RRN3P2          C1
## SNX10           C1
## SULF1           C1
## ZNF235          C1
## ZNF432          C1
## ZNF69           C1
## ADAMTS14        C2
## CCDC114         C2
## CD1D            C2
## CMPK1           C2
## KLF10           C2
## LMO7            C2
## PPM1J           C2
## SARDH           C2
## TMTC1           C2
## TNIP1           C2
## ZNF175          C2
## AEBP1           C3
## ARMC10          C3
## CDCP1           C3
## CNPY4           C3
## EPHB2           C3
## ERRFI1          C3
## GALE            C3
## HMOX1           C3
## ISG20           C3
## LOXL1           C3
## PODNL1          C3
## PTX3            C3
## SKP2            C3
## SPAG4           C3
## UBE2J2          C3
## ZNF320          C3
## ZNF480          C3
## ZNF765          C3
## ALOX15B         C4
## CHST1           C4
## FIGF            C4
## KPNA5           C4
## RCOR3           C4
## UBE2Z           C4
## ZNF136          C4
## ANKH            C5
## GUCA1B          C5
## INPP5F          C5
## NRCAM           C5
## SLC25A48        C5
## ZBTB6           C5
## AP3B2           C6
## APBB1           C6
## FAM149A         C6
## GDI2            C6
## GRHPR           C6
## L1CAM           C6
## NRXN2           C6
## OLFM1           C6
## RAB3C           C6
## RRAGB           C6
## TAF9B           C6
## TTC9            C6
## CCBL2           C7
## H6PD            C7
## MAN2B2          C7
## MIER1           C7
## PNPLA4          C7
## SLC9A1          C7
## EPS8L2          C8
## FSTL3           C8
## NSUN5           C8
## TSEN15          C8
## ZNF616          C8
## FLJ23867        C9
## HIST3H2A        C9
## HMX1            C9
## MCM3AP-AS1      C9
## PNMA5           C9
## SCARB1          C9
## SLC9A2          C9
## VWC2L           C9
## GRN            C10
## ID3            C10
## ITGB5          C10
## ORAI2          C10
## PLD3           C10
## RBBP4          C10
## SERBP1         C10
## SHISA5         C10
#100 gene glioblastoma 10 cluster table
gbm.100.clist <- lapply(sort(unique(gbm.100.clusters)), function(x) r100.glioblastoma[which(gbm.100.clusters==x),])
gbm.cluster.100.1 <- cbind(gbm.100.clist[[1]], Cluster = "C1")
gbm.cluster.100.2 <- cbind(gbm.100.clist[[2]], Cluster = "C2")
gbm.cluster.100.3 <- cbind(gbm.100.clist[[3]], Cluster = "C3")
gbm.cluster.100.4 <- cbind(gbm.100.clist[[4]], Cluster = "C4")
gbm.cluster.100.5 <- cbind(gbm.100.clist[[5]], Cluster = "C5")
gbm.cluster.100.6 <- cbind(gbm.100.clist[[6]], Cluster = "C6")
gbm.cluster.100.7 <- cbind(gbm.100.clist[[7]], Cluster = "C7")
gbm.cluster.100.8 <- cbind(gbm.100.clist[[8]], Cluster = "C8")
gbm.cluster.100.9 <- cbind(gbm.100.clist[[9]], Cluster = "C9")
gbm.cluster.100.10 <- cbind(gbm.100.clist[[10]], Cluster = "C10")
glioblastoma_100_Genes_C <- rbind(gbm.cluster.100.1, gbm.cluster.100.2, gbm.cluster.100.3, gbm.cluster.100.4,
                            gbm.cluster.100.5,gbm.cluster.100.6,gbm.cluster.100.7, gbm.cluster.100.8, 
                            gbm.cluster.100.9, gbm.cluster.100.10)

glioblastoma_100_Genes_C[,-c(1,3:9)]
##                                                                                GeneFunction
## SKP2                                              S-phase kinase-associated protein 2 (p45)
## ZNF616                                                                 In multiple clusters
## TAF9B      TAF9B RNA polymerase II, TATA box binding protein (TBP)-associated factor, 31kDa
## GUCA1B                                              Guanylate cyclase activator 1B (retina)
## SERBP1                                                      SERPINE1 mRNA binding protein 1
## SLC25A48                                                Solute carrier family 25, member 48
## TSEN15                                TRNA splicing endonuclease 15 homolog (S. cerevisiae)
## ZNF480                                                              Zinc finger protein 480
## ZNF320                                                              Zinc finger protein 320
## ZNF175                                                              Zinc finger protein 175
## ID3                  Inhibitor of DNA binding 3, dominant negative helix-loop-helix protein
## CMPK1                                  Cytidine monophosphate (UMP-CMP) kinase 1, cytosolic
## ZNF69                                                                Zinc finger protein 69
## RBBP4                                                      Retinoblastoma binding protein 4
## ZNF235                                                                 In multiple clusters
## CCNL2                                                                             Cyclin L2
## UBE2J2                            Ubiquitin-conjugating enzyme E2, J2 (UBC6 homolog, yeast)
## HIST3H2A                                                             Histone cluster 3, H2a
## ZNF765                                                              Zinc finger protein 765
## MCM3AP-AS1                                      MCM3AP antisense RNA 1 (non-protein coding)
## PLD3                                                       Phospholipase D family, member 3
## PPM1J                                          Protein phosphatase, Mg2+/Mn2+ dependent, 1J
## NRCAM                                                       Neuronal cell adhesion molecule
## KLF10                                                                Kruppel-like factor 10
## AEBP1                                                                  AE binding protein 1
## ARMC10                                                       Armadillo repeat containing 10
## AP3B2                                     Adaptor-related protein complex 3, beta 2 subunit
## SCARB1                                                                       Data not found
## OLFM1                                                                        Olfactomedin 1
## GRN                                                                                Granulin
## ORAI2                                    ORAI calcium release-activated calcium modulator 2
## FKBP9                                                       FK506 binding protein 9, 63 kDa
## GALE                                                              UDP-galactose-4-epimerase
## ITGB5                                                                      Integrin, beta 5
## CNPY4                                                          Canopy 4 homolog (zebrafish)
## ISG20                                          Interferon stimulated exonuclease gene 20kDa
## NSUN5                                                      NOP2/Sun domain family, member 5
## ERRFI1                                                   ERBB receptor feedback inhibitor 1
## SARDH                                                               Sarcosine dehydrogenase
## PODNL1                                                                       Podocan-like 1
## ADAMTS14                         ADAM metallopeptidase with thrombospondin type 1 motif, 14
## APBB1                Amyloid beta (A4) precursor protein-binding, family B, member 1 (Fe65)
## OSCAR                                   Osteoclast associated, immunoglobulin-like receptor
## CCDC114                                                   Coiled-coil domain containing 114
## EPHB2                                                                       EPH receptor B2
## HS6ST1                                               Heparan sulfate 6-O-sulfotransferase 1
## VWC2L                              Von Willebrand factor C domain-containing protein 2-like
## PNMA5                                                         Paraneoplastic antigen like 5
## CHST1                               Carbohydrate (keratan sulfate Gal-6) sulfotransferase 1
## SHISA5                                                     Shisa homolog 5 (Xenopus laevis)
## FIGF                     C-fos induced growth factor (vascular endothelial growth factor D)
## UBE2Z                                                      Ubiquitin-conjugating enzyme E2Z
## RASSF10                  Ras association (RalGDS/AF-6) domain family (N-terminal) member 10
## SLC9A2                        Solute carrier family 9 (sodium/hydrogen exchanger), member 2
## C2                                                                   Complement component 2
## FGR                           Gardner-Rasheed feline sarcoma viral (v-fgr) oncogene homolog
## MAN2B2                                               Mannosidase, alpha, class 2B, member 2
## MALT1                       Mucosa associated lymphoid tissue lymphoma translocation gene 1
## CPM                                                                      Carboxypeptidase M
## FAM149A                                       Family with sequence similarity 149, member A
## PNPLA4                                       Patatin-like phospholipase domain containing 4
## RAB3C                                                     RAB3C, member RAS oncogene family
## L1CAM                                                             L1 cell adhesion molecule
## TMTC1                               Transmembrane and tetratricopeptide repeat containing 1
## LMO7                                                                           LIM domain 7
## ANKH                                                 Ankylosis, progressive homolog (mouse)
## FLJ23867                                                      Hypothetical protein FLJ23867
## LOXL1                                                                  Lysyl oxidase-like 1
## HMX1                                                                   H6 family homeobox 1
## INPP5F                                               Inositol polyphosphate-5-phosphatase F
## H6PD                             Hexose-6-phosphate dehydrogenase (glucose 1-dehydrogenase)
## NRXN2                                                                            Neurexin 2
## CD1D                                                                          CD1d molecule
## TTC9                                                      Tetratricopeptide repeat domain 9
## SULF1                                                                           Sulfatase 1
## DHH                                                                         Desert hedgehog
## ADPRH                                                         ADP-ribosylarginine hydrolase
## SLC9A1                        Solute carrier family 9 (sodium/hydrogen exchanger), member 1
## ALOX15B                                                Arachidonate 15-lipoxygenase, type B
## PTX3                                                                      Pentraxin 3, long
## CDCP1                                                       CUB domain containing protein 1
## SNX10                                                                      Sorting nexin 10
## FSTL3                                            Follistatin-like 3 (secreted glycoprotein)
## SPAG4                                                            Sperm associated antigen 4
## HMOX1                                                          Heme oxygenase (decycling) 1
## TNIP1                                                         TNFAIP3 interacting protein 1
## RRN3P2           RNA polymerase I transcription factor homolog (S. cerevisiae) pseudogene 2
## EPS8L2                                                                          EPS8-like 2
## MIER1                          Mesoderm induction early response 1 homolog (Xenopus laevis)
## ZNF136                                                              Zinc finger protein 136
## ZBTB6                                               Zinc finger and BTB domain containing 6
## CCBL2                                                       Cysteine conjugate-beta lyase 2
## ABCB7                                ATP-binding cassette, sub-family B (MDR/TAP), member 7
## ZNF432                                                              Zinc finger protein 432
## RCOR3                                                                    REST corepressor 3
## GRHPR                                        Glyoxylate reductase/hydroxypyruvate reductase
## GDI2                                                           GDP dissociation inhibitor 2
## RRAGB                                                             Ras-related GTP binding B
## KPNA5                                                Karyopherin alpha 5 (importin alpha 6)
## PHKB                                                             Phosphorylase kinase, beta
##            Cluster
## SKP2            C1
## ZNF616          C1
## TAF9B           C1
## GUCA1B          C1
## SERBP1          C1
## SLC25A48        C1
## TSEN15          C1
## ZNF480          C1
## ZNF320          C1
## ZNF175          C1
## ID3             C1
## CMPK1           C1
## ZNF69           C1
## RBBP4           C1
## ZNF235          C1
## CCNL2           C1
## UBE2J2          C1
## HIST3H2A        C1
## ZNF765          C1
## MCM3AP-AS1      C1
## PLD3            C2
## PPM1J           C2
## NRCAM           C2
## KLF10           C2
## AEBP1           C2
## ARMC10          C2
## AP3B2           C2
## SCARB1          C2
## OLFM1           C2
## GRN             C2
## ORAI2           C2
## FKBP9           C2
## GALE            C2
## ITGB5           C3
## CNPY4           C3
## ISG20           C3
## NSUN5           C3
## ERRFI1          C3
## SARDH           C3
## PODNL1          C3
## ADAMTS14        C4
## APBB1           C4
## OSCAR           C4
## CCDC114         C4
## EPHB2           C4
## HS6ST1          C4
## VWC2L           C4
## PNMA5           C4
## CHST1           C4
## SHISA5          C4
## FIGF            C4
## UBE2Z           C4
## RASSF10         C4
## SLC9A2          C4
## C2              C5
## FGR             C5
## MAN2B2          C5
## MALT1           C5
## CPM             C5
## FAM149A         C5
## PNPLA4          C5
## RAB3C           C6
## L1CAM           C6
## TMTC1           C6
## LMO7            C6
## ANKH            C6
## FLJ23867        C6
## LOXL1           C6
## HMX1            C7
## INPP5F          C7
## H6PD            C7
## NRXN2           C7
## CD1D            C7
## TTC9            C7
## SULF1           C7
## DHH             C7
## ADPRH           C7
## SLC9A1          C7
## ALOX15B         C7
## PTX3            C8
## CDCP1           C8
## SNX10           C8
## FSTL3           C8
## SPAG4           C8
## HMOX1           C8
## TNIP1           C8
## RRN3P2          C8
## EPS8L2          C8
## MIER1           C9
## ZNF136          C9
## ZBTB6           C9
## CCBL2           C9
## ABCB7           C9
## ZNF432          C9
## RCOR3           C9
## GRHPR          C10
## GDI2           C10
## RRAGB          C10
## KPNA5          C10
## PHKB           C10
#save cluster table results to csvs 
write.csv(glioblastoma_100_Genes_C, "Table_Glioblastoma_100genes_10WClust.csv")
write.csv(glioma_100_Genes_C, "Table_Glioma_100genes_10WClust.csv")

K-means and PAM clustering of Glioma and Glioblastoma Genes

library(ggplot2)
library(ggfortify)
#Create wssplot function to help determine ideal number of clusters based on data
wssplot <- function(data, nc=20, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

#K-means clustering of glioblastoma dataset
head(glioblastoma)
##     GeneName                                             GeneFunction
## 39     ABCB7   ATP-binding cassette, sub-family B (MDR/TAP), member 7
## 62     ABCG2     ATP-binding cassette, sub-family G (WHITE), member 2
## 105    ACADM       Acyl-CoA dehydrogenase, C-4 to C-12 straight chain
## 110    ACAP1 ArfGAP with coiled-coil, ankyrin repeat and PH domains 1
## 138    ACOT7                                  Acyl-CoA thioesterase 7
## 174   ACTL6A                                            Actin-like 6A
##     Unweighted_meta_Z_allcancers   precog_Z Cox coefficient     pvalue
## 39                     0.4043703 -1.1253262      -1.8190200 -1.4916750
## 62                    -2.5096316 -2.0565740      -1.6583421 -1.0904220
## 105                   -1.2344034 -0.6418212      -1.3821405  0.5990641
## 110                   -0.8348805  0.2043125       0.5984089 -0.3160742
## 138                    1.2878316 -0.9980128       0.8907267 -1.6113469
## 174                    2.9892534  0.6034188      -1.4062422  1.0918309
##      bh_pvalue Median Expression Mean Expression
## 39  -1.2726904       -0.16881740     -0.17033129
## 62  -0.5376288       -0.09576952     -0.09488017
## 105  0.4737663       -0.01578161     -0.04205819
## 110  0.4012574       -0.26797451     -0.25152557
## 138 -2.2640969       -0.10571474     -0.09930900
## 174  0.4871126       -0.05723424     -0.06414104
row.names(glioblastoma) <- glioblastoma$GeneName
df.km.glioblastoma <- glioblastoma[-c(1:2)]
colnames(df.km.glioblastoma)<- c("unweighted_meta_z_all", "precog_Z", "cox_coefficient","pvalue", "bh_pvalue", "median_expression", "mean_expresson")
wssplot(df.km.glioblastoma)      

library(NbClust)

nc.km.glioblastoma <- NbClust(df.km.glioblastoma, min.nc=2, max.nc=25, method="kmeans")
## Warning in pf(beale, pp, df2): NaNs produced

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 9 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## * 1 proposed 18 as the best number of clusters 
## * 2 proposed 25 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************
par(opar)
table(nc.km.glioblastoma$Best.n[1,])
## 
##  0  2  3  4  5  8 15 18 25 
##  2  7  2  9  1  1  1  1  2
barplot(table(nc.km.glioblastoma$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Clusters of 512 glioblastoma Genes Chosen by 7 Criteria") 

library(cluster)
fit.km.glioblastoma <- kmeans(df.km.glioblastoma, 4, nstart=25) 
fit.km.glioblastoma$size
## [1] 102 266   2 142
fit.km.glioblastoma$centers                                               
##   unweighted_meta_z_all   precog_Z cox_coefficient     pvalue  bh_pvalue
## 1          -0.008409492  0.2306924       0.6451177 -1.4233663 -1.5402096
## 2          -0.061382418  0.1910698       0.5391439  0.4569504  0.4370407
## 3           0.271743538  0.8033295      -0.3559794  0.5990641  0.4752023
## 4           0.117197214 -0.5349426      -1.4683263  0.1580031  0.2809728
##   median_expression mean_expresson
## 1       -0.08335195    -0.08183629
## 2       -0.03740194    -0.03586408
## 3       13.40534252    14.16909386
## 4       -0.05887231    -0.07359887
aggregate(glioblastoma[-c(1:2)], by=list(cluster=fit.km.glioblastoma$cluster), mean)
##   cluster Unweighted_meta_Z_allcancers   precog_Z Cox coefficient
## 1       1                 -0.008409492  0.2306924       0.6451177
## 2       2                 -0.061382418  0.1910698       0.5391439
## 3       3                  0.271743538  0.8033295      -0.3559794
## 4       4                  0.117197214 -0.5349426      -1.4683263
##       pvalue  bh_pvalue Median Expression Mean Expression
## 1 -1.4233663 -1.5402096       -0.08335195     -0.08183629
## 2  0.4569504  0.4370407       -0.03740194     -0.03586408
## 3  0.5990641  0.4752023       13.40534252     14.16909386
## 4  0.1580031  0.2809728       -0.05887231     -0.07359887
clusplot(df.km.glioblastoma, fit.km.glioblastoma$cluster, color=TRUE, shade=TRUE, 
         labels=1, lines=0, main = "Glioblastoma K-means Cluster Plot")

#w/ ggplot
autoplot(kmeans(df.km.glioblastoma, 4), data = df.km.glioblastoma, frame = TRUE, main = "K-means Clustering of Glioblastoma Genes")

df.km <- fortify(kmeans(df.km.glioblastoma, 4), data = df.km.glioblastoma)
ggplot(df.km, aes(x= cluster, fill = cluster)) + geom_bar()

#K-means affected by outlier, check if PAM method better
set.seed(1234)
fit.pam.glioblastoma <- pam(df.km.glioblastoma, k = 4, stand=TRUE)
fit.pam.glioblastoma$medoids
##        unweighted_meta_z_all    precog_Z cox_coefficient      pvalue
## RPL7L1             0.1113523 -0.27490099      -1.4120014  0.03590205
## CHST1             -0.1604180  0.16282839       0.5388807  0.24708782
## NRSN1             -0.1626039 -0.07105643       0.8844351 -1.40720070
## EEF1A1            -1.6740676 -0.81347979      -1.3181113  1.51420242
##         bh_pvalue median_expression mean_expresson
## RPL7L1  0.4012574       -0.08261977     -0.1073970
## CHST1   0.4012574       -0.18954282     -0.1714919
## NRSN1  -1.1270882       -0.23836226     -0.2102697
## EEF1A1  0.5491473       16.09522707     14.6013264
clusplot(fit.pam.glioblastoma,color=TRUE, shade=TRUE, 
         labels=1, lines=0, main = "Glioblastoma PAM Cluster Plot")

#w/ggplot
autoplot(pam(df.km.glioblastoma, 4), data = df.km.glioblastoma, frame = TRUE, main = "PAM Clustering of Glioblastoma Genes")

df.pam <- fortify(pam(df.km.glioblastoma, 4), data = df.km.glioblastoma)
ggplot(df.pam, aes(x= cluster, fill = cluster)) + geom_bar()

#Pam method isn't better, need to remove outliers

Remove outlier and re-calculate how many clusters would be ideal

Removing the outliers greatly improved the clustering results. However, outliers might still have valuable information and should be evaluated further.

#Need to remove outliers
library(outliers)
## 
## Attaching package: 'outliers'
## The following object is masked from 'package:psych':
## 
##     outlier
outlier(df.km.glioblastoma)
## unweighted_meta_z_all              precog_Z       cox_coefficient 
##              4.050315             -3.311827             -2.306378 
##                pvalue             bh_pvalue     median_expression 
##             -1.794192             -9.757258             16.095227 
##        mean_expresson 
##             14.601326
outlier(df.km.glioblastoma, opposite = TRUE)
## unweighted_meta_z_all              precog_Z       cox_coefficient 
##            -3.4701961             2.5853601             2.0745167 
##                pvalue             bh_pvalue     median_expression 
##             1.6549929             0.6062377            -0.2742224 
##        mean_expresson 
##            -0.2580407
rm.gbm.outliers <- c("EEF1A1", "SPP1", "PTPRN", "ZIC3", "PTPRN2", "CCT3","ENO1", "CTSB", "MTHFD2", "FUCA1", 
                     "GNB2L1", "GDI2", "APLP2", "HSP90B1", "IGFBP2","NRCAM", "AEBP1", "HTRA1", "GSN")
tno.pam.glioblastoma<- df.km.glioblastoma[!rownames(df.km.glioblastoma) %in% rm.gbm.outliers,]

no.pam.glioblastoma<-scale(tno.pam.glioblastoma)*10

wssplot(no.pam.glioblastoma)

nbo.km.glioblastoma <- NbClust(tno.pam.glioblastoma, min.nc=2, max.nc=25, method="kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 10 proposed 3 as the best number of clusters 
## * 1 proposed 4 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 1 proposed 16 as the best number of clusters 
## * 1 proposed 18 as the best number of clusters 
## * 1 proposed 21 as the best number of clusters 
## * 1 proposed 22 as the best number of clusters 
## * 2 proposed 25 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
par(opar)
table(nbo.km.glioblastoma$Best.n[1,])
## 
##  0  1  2  3  4  6 16 18 21 22 25 
##  2  1  4 10  1  2  1  1  1  1  2
barplot(table(nbo.km.glioblastoma$Best.n[1,]), 
        xlab="Numer of Clusters", ylab="Number of Criteria",
        main="Number of Clusters of GBM Genes Chosen by 7 Criteria with Outliers Removed") 

Plot K-means and Pam Clustering with Outliers Removed

#K means w/o outliers
no.km.glioblastoma <- kmeans(no.pam.glioblastoma, 3, iter.max = 16, nstart=25) 
no.km.glioblastoma$size
## [1]  52 109 332
no.km.glioblastoma$centers                                               
##   unweighted_meta_z_all   precog_Z cox_coefficient     pvalue  bh_pvalue
## 1             4.2827092  0.9371114       0.2617886   2.072610   2.608026
## 2            -0.1383413  0.6537113       2.7556122 -14.039213 -15.783358
## 3            -0.6253665 -0.3613986      -0.9457071   4.284634   4.773399
##   median_expression mean_expresson
## 1         23.610877      23.983115
## 2         -1.860956      -1.931962
## 3         -3.087113      -3.122103
aggregate(no.pam.glioblastoma, by=list(cluster=no.km.glioblastoma$cluster), mean)
##   cluster unweighted_meta_z_all   precog_Z cox_coefficient     pvalue
## 1       1             4.2827092  0.9371114       0.2617886   2.072610
## 2       2            -0.1383413  0.6537113       2.7556122 -14.039213
## 3       3            -0.6253665 -0.3613986      -0.9457071   4.284634
##    bh_pvalue median_expression mean_expresson
## 1   2.608026         23.610877      23.983115
## 2 -15.783358         -1.860956      -1.931962
## 3   4.773399         -3.087113      -3.122103
str(no.km.glioblastoma)
## List of 9
##  $ cluster     : Named int [1:493] 2 2 3 3 2 3 3 3 3 3 ...
##   ..- attr(*, "names")= chr [1:493] "ABCB7" "ABCG2" "ACADM" "ACAP1" ...
##  $ centers     : num [1:3, 1:7] 4.283 -0.138 -0.625 0.937 0.654 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:7] "unweighted_meta_z_all" "precog_Z" "cox_coefficient" "pvalue" ...
##  $ totss       : num 344400
##  $ withinss    : num [1:3] 36933 54762 121399
##  $ tot.withinss: num 213093
##  $ betweenss   : num 131307
##  $ size        : int [1:3] 52 109 332
##  $ iter        : int 3
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"
clusplot(no.pam.glioblastoma, no.km.glioblastoma$cluster, color=TRUE, shade=TRUE, 
         labels=1, lines=0, main = "Glioblastoma K-means Cluster Plot, Outliers Removed")

#W/ ggplot
autoplot(kmeans(tno.pam.glioblastoma, 4), data =tno.pam.glioblastoma, frame = TRUE, main = "K-means Clustering of Glioblastoma Genes")

no.df.km <- fortify(kmeans(tno.pam.glioblastoma, 4), data = tno.pam.glioblastoma)
ggplot(no.df.km, aes(x= cluster, fill = cluster)) + geom_bar()

#PAM plot w/o outliers
gg.pam.glioblastoma<-tno.pam.glioblastoma
colnames(gg.pam.glioblastoma)<- c("unweighted_meta_z_all", "precog_Z", "cox_coefficient","pvalue", "bh_pvalue", "median_expression", "mean_expresson")
autoplot(pam(gg.pam.glioblastoma, 3), data = gg.pam.glioblastoma, frame = TRUE, main = "PAM Clustering of Glioblastoma Genes, Outliers Removed")

no.df.km <- fortify(pam(no.pam.glioblastoma, 3), data = no.pam.glioblastoma)
ggplot(no.df.km, aes(x= cluster, fill = cluster)) + geom_bar()

##Visualizing Distance Matrices (STHDA)

A simple solution for visualizing the distance matrices is to use the function corrplot() [in corrplot package]. Here I continued my exploration of clustering analysis visualization by applying the 50 glioblastoma and glioma data to visual the eculidan distances clustering with heat maps.

library("corrplot")

#Use corrplot to view distance matrices

#Glioblastoma
d.r50.gbm.eucl<- dist(nc.r50.glioblastoma, method = "euclidean")
corrplot(as.matrix(d.r50.gbm.eucl), is.corr = FALSE, method = "color")

par(opar)
#Glioma
d.r50.gli.eucl<- dist(nc.r50.glioma, method = "euclidean")
corrplot(as.matrix(d.r50.gli.eucl), is.corr = FALSE, method = "color")

par(opar)

#Use heatmap w/ dendogram
#Glioblastoma
heatmap(as.matrix(d.r50.gbm.eucl), symm = TRUE, distfun = function(x) as.dist(x))

par(opar)
#Glioma
heatmap(as.matrix(d.r50.gli.eucl), symm = TRUE, distfun = function(x) as.dist(x))

par(opar)

Visualing Distance Matrices using factoextra R package:

I explored the factoextra clustering package. The function get_dist() is used for computing a distance matrix between the rows of a data matrix. Compared to the standard dist() function used in previous methods, get_dist() supports correlation-based distance measures including “pearson”, “kendall”, and “spearman”. The function fviz_dist() creats a heatmap like plot for visualizing the distance matrix.

I used the 50 glioma gene dataset to see how different correlation-based distance measured varried over the same data.

library(factoextra)
#50 Glioma genes using Pearson Method
p.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "pearson")
fviz_dist(p.res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

par(opar)
#50 Glioma genes using Kendall Method
k.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "kendall")
fviz_dist(k.res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

par(opar)
#50 Glioma genes using Spearman Method
s.res.dist <- get_dist(nc.r50.glioma, stand = TRUE, method = "spearman")
fviz_dist(s.res.dist, 
   gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

par(opar)

K-means Clustering using Factoextra Package

The factoextra package also has a function like nbclust to help determine the optimal number of clusters for a data set. I’ve tested in on the 50 glioma gene dataset below. Unlike the nbclust function used earlier, the fviz_nbclust function suggested 4 clusters instead of 3.

fviz_nbclust(nc.r50.glioma, kmeans, method = "gap_stat")

km.res <- kmeans(nc.r50.glioma, 4, nstart = 25)
# Visualize
library("factoextra")
fviz_cluster(km.res, data = nc.r50.glioma, geom = "point", show.clust.cent = TRUE, ellipse.type = "convex")+
  theme_minimal()